In [1]:
import datashader as ds
import datashader.transfer_functions as tf
import datashader.glyphs
from datashader import reductions
from datashader.core import bypixel
from datashader.utils import lnglat_to_meters as webm, export_image
from datashader.colors import colormap_select, Greys9, viridis, inferno
import copy

from functools import partial

from pyproj import Proj, transform
import numpy as np
import pandas as pd
import urllib
import json
import datetime
import colorlover as cl

import plotly.offline as py
import plotly.graph_objs as go
from plotly import tools
import plotly.express as px

import plotly.io as pio
pio.templates.default = "plotly_white"


# from shapely.geometry import Point, Polygon, shape
# In order to get shapley, you'll need to run [pip install shapely.geometry] from your terminal

from functools import partial

from IPython.display import GeoJSON

py.init_notebook_mode()

For module 2 we'll be looking at techniques for dealing with big data. In particular binning strategies and the datashader library (which possibly proves we'll never need to bin large data for visualization ever again.)

To demonstrate these concepts we'll be looking at the PLUTO dataset put out by New York City's department of city planning. PLUTO contains data about every tax lot in New York City.

PLUTO data can be downloaded from here. Unzip them to the same directory as this notebook, and you should be able to read them in using this (or very similar) code. Also take note of the data dictionary, it'll come in handy for this assignment.

In [2]:
# Code to read in v17, column names have been updated (without upper case letters) for v18

# bk = pd.read_csv('PLUTO17v1.1/BK2017V11.csv')
# bx = pd.read_csv('PLUTO17v1.1/BX2017V11.csv')
# mn = pd.read_csv('PLUTO17v1.1/MN2017V11.csv')
# qn = pd.read_csv('PLUTO17v1.1/QN2017V11.csv')
# si = pd.read_csv('PLUTO17v1.1/SI2017V11.csv')

# ny = pd.concat([bk, bx, mn, qn, si], ignore_index=True)

ny = pd.read_csv('./pluto_22v3.csv')


# Getting rid of some outliers
ny = ny[(ny['yearbuilt'] > 1850) & (ny['yearbuilt'] < 2020) & (ny['numfloors'] != 0)]
/var/folders/hk/4s9whdq11vl_8vwj80pdhlqw0000gn/T/ipykernel_992/1914043934.py:11: DtypeWarning:

Columns (21,22,24,26,28) have mixed types. Specify dtype option on import or set low_memory=False.

I'll also do some prep for the geographic component of this data, which we'll be relying on for datashader.

You're not required to know how I'm retrieving the lattitude and longitude here, but for those interested: this dataset uses a flat x-y projection (assuming for a small enough area that the world is flat for easier calculations), and this needs to be projected back to traditional lattitude and longitude.

In [3]:
# wgs84 = Proj("+proj=longlat +ellps=GRS80 +datum=NAD83 +no_defs")
# nyli = Proj("+proj=lcc +lat_1=40.66666666666666 +lat_2=41.03333333333333 +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000 +y_0=0 +ellps=GRS80 +datum=NAD83 +to_meter=0.3048006096012192 +no_defs")
# ny['xcoord'] = 0.3048*ny['xcoord']
# ny['ycoord'] = 0.3048*ny['ycoord']
# ny['lon'], ny['lat'] = transform(nyli, wgs84, ny['xcoord'].values, ny['ycoord'].values)

# ny = ny[(ny['lon'] < -60) & (ny['lon'] > -100) & (ny['lat'] < 60) & (ny['lat'] > 20)]

#Defining some helper functions for DataShader
background = "black"
export = partial(export_image, background = background, export_path="export")
cm = partial(colormap_select, reverse=(background!="black"))

Part 1: Binning and Aggregation¶

Binning is a common strategy for visualizing large datasets. Binning is inherent to a few types of visualizations, such as histograms and 2D histograms (also check out their close relatives: 2D density plots and the more general form: heatmaps.

While these visualization types explicitly include binning, any type of visualization used with aggregated data can be looked at in the same way. For example, lets say we wanted to look at building construction over time. This would be best viewed as a line graph, but we can still think of our results as being binned by year:

In [4]:
trace = go.Scatter(
    # I'm choosing BBL here because I know it's a unique key.
    x = ny.groupby('yearbuilt').count()['bbl'].index,
    y = ny.groupby('yearbuilt').count()['bbl']
)

layout = go.Layout(
    xaxis = dict(title = 'Year Built'),
    yaxis = dict(title = 'Number of Lots Built')
)

fig = go.FigureWidget(data = [trace], layout = layout)

fig
FigureWidget({
    'data': [{'type': 'scatter',
              'uid': '2ac57696-c4c9-4c1f-a0b4-c34c38ec1fcb',
 …

Something looks off... You're going to have to deal with this imperfect data to answer this first question.

But first: some notes on pandas. Pandas dataframes are a different beast than R dataframes, here are some tips to help you get up to speed:


Hello all, here are some pandas tips to help you guys through this homework:

Indexing and Selecting: .loc and .iloc are the analogs for base R subsetting, or filter() in dplyr

Group By: This is the pandas analog to group_by() and the appended function the analog to summarize(). Try out a few examples of this, and display the results in Jupyter. Take note of what's happening to the indexes, you'll notice that they'll become hierarchical. I personally find this more of a burden than a help, and this sort of hierarchical indexing leads to a fundamentally different experience compared to R dataframes. Once you perform an aggregation, try running the resulting hierarchical datafrome through a reset_index().

Reset_index: I personally find the hierarchical indexes more of a burden than a help, and this sort of hierarchical indexing leads to a fundamentally different experience compared to R dataframes. reset_index() is a way of restoring a dataframe to a flatter index style. Grouping is where you'll notice it the most, but it's also useful when you filter data, and in a few other split-apply-combine workflows. With pandas indexes are more meaningful, so use this if you start getting unexpected results.

Indexes are more important in Pandas than in R. If you delve deeper into the using python for data science, you'll begin to see the benefits in many places (despite the personal gripes I highlighted above.) One place these indexes come in handy is with time series data. The pandas docs have a huge section on datetime indexing. In particular, check out resample, which provides time series specific aggregation.

Merging, joining, and concatenation: There's some overlap between these different types of merges, so use this as your guide. Concat is a single function that replaces cbind and rbind in R, and the results are driven by the indexes. Read through these examples to get a feel on how these are performed, but you will have to manage your indexes when you're using these functions. Merges are fairly similar to merges in R, similarly mapping to SQL joins.

Apply: This is explained in the "group by" section linked above. These are your analogs to the plyr library in R. Take note of the lambda syntax used here, these are anonymous functions in python. Rather than predefining a custom function, you can just define it inline using lambda.

Browse through the other sections for some other specifics, in particular reshaping and categorical data (pandas' answer to factors.) Pandas can take a while to get used to, but it is a pretty strong framework that makes more advanced functions easier once you get used to it. Rolling functions for example follow logically from the apply workflow (and led to the best google results ever when I first tried to find this out and googled "pandas rolling")

Google Wes Mckinney's book "Python for Data Analysis," which is a cookbook style intro to pandas. It's an O'Reilly book that should be pretty available out there.


Question¶

After a few building collapses, the City of New York is going to begin investigating older buildings for safety. The city is particularly worried about buildings that were unusually tall when they were built, since best-practices for safety hadn’t yet been determined. Create a graph that shows how many buildings of a certain number of floors were built in each year (note: you may want to use a log scale for the number of buildings). Find a strategy to bin buildings (It should be clear 20-29-story buildings, 30-39-story buildings, and 40-49-story buildings were first built in large numbers, but does it make sense to continue in this way as you get taller?)

Start your answer here, inserting more cells as you go along¶

Part 1: Visualization¶

The first approach to understanding which buildings to focus seeks to uncover these insights by aggregating the Pluto data set. The above graph shows that the year-built data seems to lack accuracy and might be survey-based. To account for this, we are grouping the year-built data by decades. To simplify the visualization, the number of floors in each building is also grouped by counts of 10. And finally, since there are soo few buildings that have more than 80 floors, buildings with 80 floors and above are grouped into one category.

In [5]:
ny_df = ny[['borough','block','lot','yearbuilt','numfloors']].copy()
ny_df.loc[ny_df['numfloors']  >= 80,'numfloors'] = 80
ny_df = ny_df[~pd.isnull(ny_df['numfloors'])]
ny_df['decbuilt'] = (ny_df.yearbuilt//10)*10
ny_df['floorgroup'] = (ny_df.numfloors//10)*10
ny_df.head()
Out[5]:
borough block lot yearbuilt numfloors decbuilt floorgroup
1 BK 3435 45 1901.0 2.0 1900.0 0.0
2 BK 3447 29 1910.0 2.0 1910.0 0.0
3 BX 2514 10 1899.0 2.0 1890.0 0.0
4 MN 482 7501 1900.0 7.0 1900.0 0.0
5 BK 3434 8 1899.0 2.0 1890.0 0.0
In [6]:
ny_df.shape
Out[6]:
(809446, 7)

The next step in the process is counting the buildings that fall into decade built groups and the number of floors group. The resulting data set has less than 100 rows and is much easier to manage from a visualization perspective.

In [7]:
ny_df = ny_df.groupby(['decbuilt','floorgroup'])[['numfloors']].count().sort_values('decbuilt',ascending=True)
ny_df = ny_df.rename(columns={"numfloors": "count"})
ny_df['count_log'] =  np.log2(ny_df['count'])
ny_df = ny_df.reset_index()
ny_df = ny_df.sort_values(['decbuilt','floorgroup'])
ny_df
Out[7]:
decbuilt floorgroup count count_log
0 1850.0 0.0 1567 10.613789
1 1850.0 10.0 2 1.000000
2 1860.0 0.0 1553 10.600842
3 1860.0 10.0 4 2.000000
4 1870.0 0.0 2764 11.432542
... ... ... ... ...
92 2010.0 40.0 46 5.523562
90 2010.0 50.0 26 4.700440
91 2010.0 60.0 15 3.906891
93 2010.0 70.0 6 2.584963
98 2010.0 80.0 2 1.000000

99 rows × 4 columns

In [8]:
ny_df.shape
Out[8]:
(99, 4)

The first visualization uses a scatterplot to view the summarized data. To highlight areas of focus, the number of floors is used for the color and the size of the data point. The log of the building count was used to add clarity to the y-axis.

Key Findings:

  • First 30-40 floor buildings were built in 1880
  • First 70+ floor building appeared in the early 1900
  • First 80+ floor building appeared in 1920
  • Rise number of 50-60 floor buildings in 1920
  • Rise in the number of 60-70 floor buildings in the 1930s
  • Rise in the number of 70-80 floor buildings in the 1980s

These indicate particular years to focus on specific types of buildings. For example, buildings over 70+ built prior to 1980 and built over 80+ floors in general since there are so few buildings that size.

This question brings to mind a recent architectural crisis in with one of the building in midtown. https://en.wikipedia.org/wiki/Citicorp_Center_engineering_crisis

In [9]:
fig = px.scatter(data_frame=ny_df, 
                x='decbuilt', y='count_log',
                color="floorgroup", 
                size="floorgroup",
                width=1000, height=600, 
                title='Building (by Year)',   
            )

fig.show()

The bar graph visialization of the same data highlights similar findings in an easier to digest format.

In [10]:
fig = px.bar(ny_df, x="decbuilt", y="count_log", 
             width=1000, height=500,
             color='floorgroup', 
             title='NY Building (by Year)')
fig.show()

This is a slightly less elegant visual but it provides a clear outline of when a particular building hight started to gain prominance in New York City. It is somewhat intuative but it is interesting to see the clear delineation of when buildings of certain were developed.

  • Buildings over 20 stories where not common before the turn of the century
  • There was a brief halt in building over 30 floors during the 2nd world war
  • There are some interesting gaps in developmetn of building over 60 floors

The benefit of this visualization is that it allows the user to focus on individual building and clusters of buildings that stand out from the population.

In [11]:
ny0_df = ny[['borough','block','lot','yearbuilt','numfloors']].copy()
ny0_df.loc[ny0_df['numfloors']  >= 80,'numfloors'] = 80
ny0_df['decbuilt'] = (ny0_df.yearbuilt//10)*10
ny0_df['floorgroup'] = (ny0_df.numfloors//10)*10

ny0_df = ny0_df[~pd.isnull(ny0_df['floorgroup'])]
ny0_df = ny0_df.sort_values('floorgroup', ascending=True)

ny0_df.head()
Out[11]:
borough block lot yearbuilt numfloors decbuilt floorgroup
1 BK 3435 45 1901.0 2.0 1900.0 0.0
565443 BK 5674 15 1920.0 2.0 1920.0 0.0
565444 BK 5674 34 1924.0 2.0 1920.0 0.0
565445 BK 5674 49 1920.0 2.0 1920.0 0.0
565446 BK 5674 51 1920.0 2.0 1920.0 0.0
In [34]:
fig = px.scatter(ny0_df, x='yearbuilt', y='numfloors',
                 width=1500, height=1000, color='borough',
                 facet_col='floorgroup',  facet_col_wrap=3)

fig.update_yaxes(matches=None)
fig.for_each_yaxis(lambda yaxis: yaxis.update(showticklabels=True))

fig.show()

Part 2: Datashader¶

Datashader is a library from Anaconda that does away with the need for binning data. It takes in all of your datapoints, and based on the canvas and range returns a pixel-by-pixel calculations to come up with the best representation of the data. In short, this completely eliminates the need for binning your data.

As an example, lets continue with our question above and look at a 2D histogram of YearBuilt vs NumFloors:

In [13]:
#ny1_df = 

#ny_df['count_log'] =  np.log2(ny_df['count'])


ny1_df = ny[['borough','block','lot','yearbuilt','numfloors']].copy()
ny1_df.loc[ny1_df['numfloors']  >= 80,'numfloors'] = 80
ny1_df['decbuilt'] = (ny1_df.yearbuilt//10)*10
ny1_df['floorgroup'] = (ny1_df.numfloors//10)*10
ny1_df.head()
Out[13]:
borough block lot yearbuilt numfloors decbuilt floorgroup
1 BK 3435 45 1901.0 2.0 1900.0 0.0
2 BK 3447 29 1910.0 2.0 1910.0 0.0
3 BX 2514 10 1899.0 2.0 1890.0 0.0
4 MN 482 7501 1900.0 7.0 1900.0 0.0
5 BK 3434 8 1899.0 2.0 1890.0 0.0
In [14]:
fig = go.FigureWidget(
    data = [
        go.Histogram2d(x=ny1_df['decbuilt'], y=ny1_df['floorgroup'], autobiny=True, ybins={'size': 1}, colorscale='Greens')
    ]
)

fig
FigureWidget({
    'data': [{'autobiny': True,
              'colorscale': [[0.0, 'rgb(247,252,245)'], [0.125,…
In [15]:
fig = go.FigureWidget(
    data = [
        go.Histogram2d(x=ny['yearbuilt'], y=ny['numfloors'], autobiny=False, ybins={'size': 1}, colorscale='Greens')
    ]
)

fig
FigureWidget({
    'data': [{'autobiny': False,
              'colorscale': [[0.0, 'rgb(247,252,245)'], [0.125…

This shows us the distribution, but it's subject to some biases discussed in the Anaconda notebook Plotting Perils.

Here is what the same plot would look like in datashader:

In [16]:
#Defining some helper functions for DataShader
background = "black"
export = partial(export_image, background = background, export_path="export")
cm = partial(colormap_select, reverse=(background!="black"))

cvs = ds.Canvas(800, 500, x_range = (ny['yearbuilt'].min(), ny['yearbuilt'].max()), 
                                y_range = (ny['numfloors'].min(), ny['numfloors'].max()))
agg = cvs.points(ny, 'yearbuilt', 'numfloors')
view = tf.shade(agg, cmap = cm(Greys9), how='log')
export(tf.spread(view, px=2), 'yearvsnumfloors')
Out[16]:

That's technically just a scatterplot, but the points are smartly placed and colored to mimic what one gets in a heatmap. Based on the pixel size, it will either display individual points, or will color the points of denser regions.

Datashader really shines when looking at geographic information. Here are the latitudes and longitudes of our dataset plotted out, giving us a map of the city colored by density of structures:

In [17]:
NewYorkCity   = (( 913164.0,  1067279.0), (120966.0, 272275.0))
cvs = ds.Canvas(700, 700, *NewYorkCity)
agg = cvs.points(ny, 'xcoord', 'ycoord')
view = tf.shade(agg, cmap = cm(inferno), how='log')
export(tf.spread(view, px=2), 'firery')
Out[17]:

Interestingly, since we're looking at structures, the large buildings of Manhattan show up as less dense on the map. The densest areas measured by number of lots would be single or multi family townhomes.

Unfortunately, Datashader doesn't have the best documentation. Browse through the examples from their github repo. I would focus on the visualization pipeline and the US Census Example for the question below. Feel free to use my samples as templates as well when you work on this problem.

Question¶

You work for a real estate developer and are researching underbuilt areas of the city. After looking in the Pluto data dictionary, you've discovered that all tax assessments consist of two parts: The assessment of the land and assessment of the structure. You reason that there should be a correlation between these two values: more valuable land will have more valuable structures on them (more valuable in this case refers not just to a mansion vs a bungalow, but an apartment tower vs a single family home). Deviations from the norm could represent underbuilt or overbuilt areas of the city. You also recently read a really cool blog post about bivariate choropleth maps, and think the technique could be used for this problem.

Datashader is really cool, but it's not that great at labeling your visualization. Don't worry about providing a legend, but provide a quick explanation as to which areas of the city are overbuilt, which areas are underbuilt, and which areas are built in a way that's properly correlated with their land value.

Prepare Data¶

The first step in the analysis is resetting the data from the previous question. A reduced number of columns was selected from the ny data frame and a copied to the ny_df data frame. The year built and floors data is broken down into categories. The assessment data is decomposed to separate the assessment for the land, the building, and the plot total.

note: there are 200+ rows without assessments. These rows are omitted since they do not add additional insight into the analysis.

In [18]:
ny_df = ny[['latitude','longitude','borough','block','lot','yearbuilt','numfloors','assessland','assesstot']].copy()

ny_df = ny_df[ny_df['assesstot'] > 0]

ny_df.loc[ny_df['numfloors']  >= 80,'numfloors'] = 80
ny_df['group_yearbuilt'] = (ny_df.yearbuilt//10)*10
ny_df['group_fnumfloors'] = (ny_df.numfloors//10)*10


ny_df['bmean_assesstot'] = ny_df.groupby('borough').assesstot.transform('mean')
ny_df['assessbuild'] = ny_df['assesstot'] - ny_df['assessland']

ny_df['prop_assesstot'] = ny_df['assesstot'] / ny_df['bmean_assesstot']
ny_df['prop_assessbuild'] = ny_df['assessbuild'] / ny_df['assesstot']
ny_df['prop_assessland'] = ny_df['assessland'] / ny_df['assesstot']



# Build Categories Choropleth Map.
pct = 100 / 3
labels_land = ['A', 'B', 'C']
labels_build = ['1', '2', '3']


bp_build = np.percentile(ny_df['assessbuild'], [pct, 100 - pct], axis=0)
bp_land = np.percentile(ny_df['assessland'], [pct, 100 - pct], axis=0)

ny_df['pt_assessbuild'] = pd.cut(ny_df['assessbuild'], [0, bp_build[0], bp_build[1], np.inf], right=False, labels=labels_build)
ny_df['pt_assessland'] = pd.cut(ny_df['assessland'], [0, bp_land[0], bp_land[1], np.inf], right=False, labels=labels_land)

ny_df['g_assess'] = pd.Categorical(ny_df['pt_assessland'].str.cat(ny_df['pt_assessbuild'], sep='-'))

ny_df.head(10)
Out[18]:
latitude longitude borough block lot yearbuilt numfloors assessland assesstot group_yearbuilt group_fnumfloors bmean_assesstot assessbuild prop_assesstot prop_assessbuild prop_assessland pt_assessbuild pt_assessland g_assess
1 40.688820 -73.906684 BK 3435 45 1901.0 2.0 12300.0 74940.0 1900.0 0.0 3.168240e+05 62640.0 0.236535 0.835869 0.164131 3 B B-3
2 40.688377 -73.905239 BK 3447 29 1910.0 2.0 11940.0 75780.0 1910.0 0.0 3.168240e+05 63840.0 0.239186 0.842439 0.157561 3 B B-3
3 40.834979 -73.927858 BX 2514 10 1899.0 2.0 14340.0 34560.0 1890.0 0.0 3.665232e+05 20220.0 0.094291 0.585069 0.414931 1 B B-1
4 40.721202 -73.997742 MN 482 7501 1900.0 7.0 1143002.0 6811650.0 1900.0 0.0 5.963870e+06 5668648.0 1.142153 0.832199 0.167801 3 C C-3
5 40.687098 -73.908828 BK 3434 8 1899.0 2.0 12000.0 70380.0 1890.0 0.0 3.168240e+05 58380.0 0.222142 0.829497 0.170503 2 B B-2
6 40.692845 -73.954582 BK 1763 7506 2006.0 6.0 279556.0 2405206.0 2000.0 0.0 3.168240e+05 2125650.0 7.591616 0.883770 0.116230 3 C C-3
7 40.686622 -73.907668 BK 3440 60 1910.0 2.0 20040.0 83160.0 1910.0 0.0 3.168240e+05 63120.0 0.262480 0.759019 0.240981 3 C C-3
8 40.687147 -73.907610 BK 3440 124 2004.0 3.0 13020.0 73320.0 2000.0 0.0 3.168240e+05 60300.0 0.231422 0.822422 0.177578 3 B B-3
9 40.686491 -73.907798 BK 3440 63 1910.0 2.0 11580.0 74100.0 1910.0 0.0 3.168240e+05 62520.0 0.233884 0.843725 0.156275 3 B B-3
10 40.686222 -73.907355 BK 3446 11 1910.0 2.0 11580.0 74400.0 1910.0 0.0 3.168240e+05 62820.0 0.234831 0.844355 0.155645 3 B B-3
In [19]:
ny_df.shape
Out[19]:
(811454, 19)
In [20]:
ny_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 811454 entries, 1 to 858618
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   latitude          811247 non-null  float64 
 1   longitude         811247 non-null  float64 
 2   borough           811454 non-null  object  
 3   block             811454 non-null  int64   
 4   lot               811454 non-null  int64   
 5   yearbuilt         811454 non-null  float64 
 6   numfloors         809254 non-null  float64 
 7   assessland        811454 non-null  float64 
 8   assesstot         811454 non-null  float64 
 9   group_yearbuilt   811454 non-null  float64 
 10  group_fnumfloors  809254 non-null  float64 
 11  bmean_assesstot   811454 non-null  float64 
 12  assessbuild       811454 non-null  float64 
 13  prop_assesstot    811454 non-null  float64 
 14  prop_assessbuild  811454 non-null  float64 
 15  prop_assessland   811454 non-null  float64 
 16  pt_assessbuild    811454 non-null  category
 17  pt_assessland     811454 non-null  category
 18  g_assess          811454 non-null  category
dtypes: category(3), float64(13), int64(2), object(1)
memory usage: 107.6+ MB

Sample Dataframe¶

The full data frame is a little large for graphing and analysis there for a smaller sample dataset is randomly selected from the broader data set. The expectation is that the smaller dataset will exhibit the same characteristics as the larger data set. To increase the relevance of the analysis, we removed outliers.

In [21]:
np.random.seed(312)
msk = np.random.rand(len(ny_df)) < 0.05
s_df = ny_df[msk].copy()

# filter out some extream cases
s_df = s_df[s_df['assessbuild'] < 10000000]

s_df.shape
Out[21]:
(40005, 19)

The five boroughs of new york city are distinct and exhibit very different profiles in the data set. The following graphs use the borough as a facet to better understand the difference and similarities across New York. For each graph, the y-axis is scaled based on the borough-specific data. The distribution of building value as a proportion of overall plot value is left-skewed, with a high proportion of the properties associating most of the plot assessment to the value of the building.

In [22]:
fig = px.scatter(s_df, x='prop_assessbuild', y='assesstot',
                facet_col='borough', color='borough',  facet_col_wrap=5)

fig.update_yaxes(matches=None)
fig.for_each_yaxis(lambda yaxis: yaxis.update(showticklabels=True))

fig.show()
In [23]:
fig = px.histogram(s_df, x="prop_assessbuild", color='borough')
fig.show()

The distribution profile for the land assessment in relation to the total assessment of the property is right-skewed. There is a cluster of observations that approach zero, suggesting measurement errors.

In [24]:
fig = px.scatter(s_df, x='prop_assessland', y='assesstot',
                facet_col='borough', color='borough',facet_col_wrap=5)
fig.update_yaxes(matches=None)
fig.for_each_yaxis(lambda yaxis: yaxis.update(showticklabels=True))

fig.show()
In [25]:
fig = px.histogram(s_df, x="prop_assessland", color='borough')
fig.show()

The scatter plot of building assessment vs total assessments highlights some difference across the boroughs. In order to adjust for some outliers in Manhatttan the y scale was limited to 3M for the total asessment.

In [26]:
fig = px.scatter(data_frame=s_df, 
                x='prop_assessbuild', y='assesstot',
                color="g_assess", 
                #size="floorgroup",
                width=1000, height=600, 
                title='Building Assessment', 
                range_y=[0,3000000]
            )

fig.show()
In [27]:
fig = px.scatter(data_frame=s_df, 
                x='prop_assessland', y='assesstot',
                color="g_assess", 
                #size="floorgroup",
                width=1000, height=600, 
                title='Land Assessment', 
                range_y=[0,3000000]
            )

fig.show()

Since I needed to determine the specific color for each coordinate pair, I separated the land and building values into three bins. Once I had the necessary breakpoints, I labeled each coordinate with the proper land and building value. Finally, with each value determined, I concatenated the values to finalize the category A1-C3 to define the proper color.

  • Underdeveloped category is C1 - higher land valuation with a low building valuation
  • Overdeveloped category is A3 - low-value land with high building valuations
  • The bulk of the records fall in the equilibrium buckets of A1, B2, and c3
In [28]:
fig = px.histogram(s_df, x="g_assess", color='borough')
fig.show()

The Bivariate Choropleth Maps capture the relationship between land and building assessments and highlights some key characteristics of NY City. The majority of the property assessments fall into the A-1, B-2, and C-3 buckets.

In [29]:
color_key = {'C-3':'#3b4994', 
             'B-3':'#8c62aa',
             'A-3':'#be64ac', 
             'C-2':'#5698b9', 
             'B-2':'#a5add3',
             'A-2':'#dfb0d6',
             'C-1':'#5ac8c8', 
             'B-1':'#ace4e4',
             'A-1':'#e8e8e8'}
In [30]:
background = "black"
export = partial(export_image, background = background, export_path="export")
cm = partial(colormap_select, reverse=(background!="black"))
In [31]:
NewYorkCity   = (( -74.29,  -73.69), (40.49, 40.92))
cvs = ds.Canvas(700, 700, *NewYorkCity)
agg = cvs.points(s_df,'longitude','latitude', ds.count_cat('g_assess'))
view = tf.shade(agg, color_key=color_key)
export(tf.spread(view, px=1), 'ny_df')
Out[31]:

Full Dataframe¶

Extending this analysis to the broader population produces similar results. The analysis highlights that NYC is a mature city with opportunities to take advantage of underdevelopment only at the margins.

In [32]:
fig = px.histogram(ny_df, x="g_assess", color='borough')
fig.show()

The resulting Bivariate Choropleth Map suggests what one would suspect. The vast majority of the real estate in NYC has been developed in line with its relative value, and while there are opportunities, they are relatively few. The analysis is as follows:

  • Underdeveloped category is C1 (Green) - higher land valuation with a low building valuation
  • Overdeveloped category is A3 (Pink) - low-value land with high building valuations
  • The bulk of the records fall in the equilibrium buckets of A1, B2, and c3

I suspect that the overbuilt areas of the map are industrial in nature, and the lower developed areas are likely to close to infrastructure that impacts the ability to develop tracts of land fully.

In [33]:
NewYorkCity   = (( -74.29,  -73.69), (40.49, 40.92))
cvs = ds.Canvas(700, 700, *NewYorkCity)
agg = cvs.points(ny_df,'longitude','latitude', ds.count_cat('g_assess'))
view = tf.shade(agg, color_key=color_key)
export(tf.spread(view, px=1), 'ny_df')
Out[33]: